Assessment: Gemini Batch Fix & Dataset Preview Row Limiting#820
Assessment: Gemini Batch Fix & Dataset Preview Row Limiting#820vprashrex wants to merge 5 commits into
Conversation
📝 WalkthroughWalkthroughAdds optional dataset preview (headers + first N rows) to GET /datasets/{id} with models, service parsing CSV/XLSX, docs and tests; moves Gemini/Google JSONL row identifier to a top-level ChangesAssessment dataset preview
Gemini Batch JSONL Schema
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@backend/app/api/routes/assessment/datasets.py`:
- Around line 149-161: The truncated flag is over-reported because the code
treats len(rows) >= limit_rows as truncated; to fix, request one extra row from
preview_assessment_dataset (call with limit=limit_rows + 1), set truncated =
len(rows) > limit_rows, and if truncated trim rows to the original limit_rows
before constructing AssessmentDatasetPreview (use the existing names session,
dataset, limit_rows, preview_assessment_dataset, headers, rows, and
AssessmentDatasetPreview).
In `@backend/app/services/assessment/dataset.py`:
- Around line 197-219: The current preview logic defaults to CSV for any
non-".xlsx" file_ext which can silently mis-handle missing/invalid metadata;
update the preview path in the preview function (where file_ext is derived) to
validate file_ext explicitly (normalize with .lower() and strip), and only allow
known extensions like ".xlsx" and ".csv"; if file_ext is None or not in the
allowed set, raise HTTPException(status_code=422, detail="Unsupported or missing
file extension.") instead of calling _preview_csv, otherwise call _preview_excel
for ".xlsx" and _preview_csv for ".csv".
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 60a75072-f1b1-49b2-b789-0b3989427bea
📒 Files selected for processing (5)
backend/app/api/docs/assessment/get_dataset.mdbackend/app/api/routes/assessment/datasets.pybackend/app/models/assessment.pybackend/app/services/assessment/dataset.pybackend/app/tests/assessment/test_batch.py
✅ Files skipped from review due to trivial changes (1)
- backend/app/api/docs/assessment/get_dataset.md
| file_ext = (dataset.dataset_metadata or {}).get("file_extension") | ||
| if file_ext == ".xls": | ||
| raise HTTPException( | ||
| status_code=422, | ||
| detail="Legacy Excel format (.xls) is not supported.", | ||
| ) | ||
|
|
||
| storage = get_cloud_storage(session=session, project_id=project_id) | ||
| try: | ||
| content = storage.get(dataset.object_store_url) | ||
| except Exception as e: | ||
| logger.warning( | ||
| f"[preview_dataset] Failed to fetch file | dataset_id={dataset.id} | {e}", | ||
| exc_info=True, | ||
| ) | ||
| raise HTTPException( | ||
| status_code=502, detail="Failed to fetch dataset file from storage." | ||
| ) from e | ||
|
|
||
| try: | ||
| if file_ext == ".xlsx": | ||
| return _preview_excel(content, limit) | ||
| return _preview_csv(content, limit) |
There was a problem hiding this comment.
Reject unknown/missing file extensions instead of defaulting to CSV.
Line 219 defaults to CSV parsing for non-.xlsx values. If metadata is missing/incorrect, preview can return garbage instead of a clear 422.
Suggested fix
- file_ext = (dataset.dataset_metadata or {}).get("file_extension")
+ file_ext = ((dataset.dataset_metadata or {}).get("file_extension") or "").lower()
if file_ext == ".xls":
raise HTTPException(
status_code=422,
detail="Legacy Excel format (.xls) is not supported.",
)
+ if file_ext not in {".csv", ".xlsx"}:
+ raise HTTPException(
+ status_code=422,
+ detail="Unsupported or missing dataset file extension for preview.",
+ )
...
- if file_ext == ".xlsx":
+ if file_ext == ".xlsx":
return _preview_excel(content, limit)
return _preview_csv(content, limit)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| file_ext = (dataset.dataset_metadata or {}).get("file_extension") | |
| if file_ext == ".xls": | |
| raise HTTPException( | |
| status_code=422, | |
| detail="Legacy Excel format (.xls) is not supported.", | |
| ) | |
| storage = get_cloud_storage(session=session, project_id=project_id) | |
| try: | |
| content = storage.get(dataset.object_store_url) | |
| except Exception as e: | |
| logger.warning( | |
| f"[preview_dataset] Failed to fetch file | dataset_id={dataset.id} | {e}", | |
| exc_info=True, | |
| ) | |
| raise HTTPException( | |
| status_code=502, detail="Failed to fetch dataset file from storage." | |
| ) from e | |
| try: | |
| if file_ext == ".xlsx": | |
| return _preview_excel(content, limit) | |
| return _preview_csv(content, limit) | |
| file_ext = ((dataset.dataset_metadata or {}).get("file_extension") or "").lower() | |
| if file_ext == ".xls": | |
| raise HTTPException( | |
| status_code=422, | |
| detail="Legacy Excel format (.xls) is not supported.", | |
| ) | |
| if file_ext not in {".csv", ".xlsx"}: | |
| raise HTTPException( | |
| status_code=422, | |
| detail="Unsupported or missing dataset file extension for preview.", | |
| ) | |
| storage = get_cloud_storage(session=session, project_id=project_id) | |
| try: | |
| content = storage.get(dataset.object_store_url) | |
| except Exception as e: | |
| logger.warning( | |
| f"[preview_dataset] Failed to fetch file | dataset_id={dataset.id} | {e}", | |
| exc_info=True, | |
| ) | |
| raise HTTPException( | |
| status_code=502, detail="Failed to fetch dataset file from storage." | |
| ) from e | |
| try: | |
| if file_ext == ".xlsx": | |
| return _preview_excel(content, limit) | |
| return _preview_csv(content, limit) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@backend/app/services/assessment/dataset.py` around lines 197 - 219, The
current preview logic defaults to CSV for any non-".xlsx" file_ext which can
silently mis-handle missing/invalid metadata; update the preview path in the
preview function (where file_ext is derived) to validate file_ext explicitly
(normalize with .lower() and strip), and only allow known extensions like
".xlsx" and ".csv"; if file_ext is None or not in the allowed set, raise
HTTPException(status_code=422, detail="Unsupported or missing file extension.")
instead of calling _preview_csv, otherwise call _preview_excel for ".xlsx" and
_preview_csv for ".csv".
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
🧹 Nitpick comments (3)
backend/app/tests/assessment/test_dataset.py (3)
148-163: 💤 Low valueConsider importing openpyxl at module level.
openpyxlis already imported at line 7 forInvalidFileException. Importing it again inside the test function (lines 149, 151) is inconsistent with the module-level import pattern.♻️ Proposed consolidation
At the top of the file, consolidate the imports:
from openpyxl.utils.exceptions import InvalidFileException +import openpyxl +import ioThen remove the inline imports in the test functions.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/app/tests/assessment/test_dataset.py` around lines 148 - 163, The test function test_preview_excel_returns_headers_and_rows imports openpyxl locally even though openpyxl is already imported at module level for InvalidFileException; remove the inline imports inside test_preview_excel_returns_headers_and_rows and any other tests, and add/ensure a single module-level import for openpyxl alongside InvalidFileException so the test uses that top-level import instead.
142-146: ⚡ Quick winStrengthen the latin-1 fallback assertion.
The test claims to verify latin-1 fallback but only checks that the value starts with "ca". It should verify that the invalid UTF-8 byte
\xffwas correctly decoded asÿ(U+00FF in latin-1) rather than dropped.✨ Proposed stronger assertion
def test_preview_csv_handles_latin1_fallback(self) -> None: # \xff is invalid utf-8 -> falls back to latin-1 headers, rows = _preview_csv(b"name\nca\xfffe\n", limit=5) assert headers == ["name"] - assert rows and rows[0][0].startswith("ca") + assert rows and rows[0][0] == "caÿfe"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/app/tests/assessment/test_dataset.py` around lines 142 - 146, Update the test_preview_csv_handles_latin1_fallback assertion to verify the latin-1 decoded character is present: call _preview_csv as before, then assert that the first cell exactly equals "caÿ" or contains the Unicode character U+00FF (ÿ) to ensure the invalid UTF-8 byte 0xFF was decoded via latin-1; refer to the test function name test_preview_csv_handles_latin1_fallback and the helper _preview_csv when locating the change.
165-175: ⚡ Quick winClarify expected behavior for empty workbooks.
The assertion at line 174 accepts two different outcomes (
[""]or[]), which suggests either:
- The expected behavior for empty workbooks is not well-defined, or
- The test is being overly permissive.
Consider determining the correct expected behavior and asserting only that outcome.
♻️ Proposed fix
If empty workbooks should return an empty list:
- assert headers == [""] or headers == [] + assert headers == []Or if they should return a list with one empty string:
- assert headers == [""] or headers == [] + assert headers == [""]🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/app/tests/assessment/test_dataset.py` around lines 165 - 175, The test test_preview_excel_empty_workbook is ambiguous because it accepts two outcomes for headers; decide the canonical behavior for _preview_excel (either return [] for no headers or [""] to represent a single empty header) and update the test to assert that single expected value only; locate the test function test_preview_excel_empty_workbook and the helper _preview_excel, then change the assertion to assert headers == <chosen_expected_value> (and keep assert rows == []), or if you choose to change _preview_excel instead, make it return the chosen headers shape for an empty workbook and keep the test asserting that one outcome.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@backend/app/tests/assessment/test_dataset.py`:
- Around line 148-163: The test function
test_preview_excel_returns_headers_and_rows imports openpyxl locally even though
openpyxl is already imported at module level for InvalidFileException; remove
the inline imports inside test_preview_excel_returns_headers_and_rows and any
other tests, and add/ensure a single module-level import for openpyxl alongside
InvalidFileException so the test uses that top-level import instead.
- Around line 142-146: Update the test_preview_csv_handles_latin1_fallback
assertion to verify the latin-1 decoded character is present: call _preview_csv
as before, then assert that the first cell exactly equals "caÿ" or contains the
Unicode character U+00FF (ÿ) to ensure the invalid UTF-8 byte 0xFF was decoded
via latin-1; refer to the test function name
test_preview_csv_handles_latin1_fallback and the helper _preview_csv when
locating the change.
- Around line 165-175: The test test_preview_excel_empty_workbook is ambiguous
because it accepts two outcomes for headers; decide the canonical behavior for
_preview_excel (either return [] for no headers or [""] to represent a single
empty header) and update the test to assert that single expected value only;
locate the test function test_preview_excel_empty_workbook and the helper
_preview_excel, then change the assertion to assert headers ==
<chosen_expected_value> (and keep assert rows == []), or if you choose to change
_preview_excel instead, make it return the chosen headers shape for an empty
workbook and keep the test asserting that one outcome.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 3ab2e4d5-c58f-4731-ada1-cfe46d7ba28d
📒 Files selected for processing (2)
backend/app/tests/assessment/test_dataset.pybackend/app/tests/assessment/test_routes.py
Target issue: #830
Summary
This pull request addresses two key issues within the assessment module:
Previously, large dataset responses caused frontend browser lag and UI freezes due to excessive data rendering on the client side. By introducing row limiting, the frontend can now request lightweight previews (for example, 5 rows), resulting in improved performance and a smoother user experience.
Checklist
Before submitting a pull request, please ensure that you mark these task.
fastapi run --reload app/main.pyordocker compose upin the repository root and test.Notes
Please add here if any other information is required for the reviewer.
Summary by CodeRabbit
New Features
Documentation
Refactor
Bug Fixes
Tests